18 research outputs found

    NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

    Full text link
    Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resulting corpora, including a lack of lexical diversity and cultural relevance to local communities. To address this gap, we conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content. In addition, we present the \datasetname{} benchmark, encompassing 12 underrepresented and extremely low-resource languages spoken by millions of individuals in Indonesia. Our empirical experiment results using existing multilingual large language models conclude the need to extend these models to more underrepresented languages. We release the NusaWrites dataset at https://github.com/IndoNLP/nusa-writes

    A Genetic Locus within the FMN1/GREM1 Gene Region Interacts with Body Mass Index in Colorectal Cancer Risk

    Full text link
    Colorectal cancer risk can be impacted by genetic, environmental, and lifestyle factors, including diet and obesity. Geneenvironment interactions (G x E) can provide biological insights into the effects of obesity on colorectal cancer risk. Here, we assessed potential genome-wide G x E interactions between body mass index (BMI) and common SNPs for colorectal cancer risk using data from 36,415 colorectal cancer cases and 48,451 controls from three international colorectal cancer consortia (CCFR, CORECT, and GECCO). The G x E tests included the conventional logistic regression using multiplicative terms (one degree of freedom, 1DF test), the two-step EDGE method, and the joint 3DF test, each of which is powerful for detecting G x E interactions under specific conditions. BMI was associated with higher colorectal cancer risk. The two-step approach revealed a statistically significant GxBMI interaction located within the Formin 1/Gremlin 1 (FMN1/GREM1) gene region (rs58349661). This SNP was also identified by the 3DF test, with a suggestive statistical significance in the 1DF test. Among participants with the CC genotype of rs58349661, overweight and obesity categories were associated with higher colorectal cancer risk, whereas null associations were observed across BMI categories in those with the TT genotype. Using data from three large international consortia, this study discovered a locus in the FMN1/GREM1 gene region that interacts with BMI on the association with colorectal cancer risk. Further studies should examine the potential mechanisms through which this locus modifies the etiologic link between obesity and colorectal cancer

    The Impact of Large-Scale Social Restriction Phases on the Air Quality Index in Jakarta

    No full text
    We reported the result of our study on the impact of Large-Scale Social Restriction (LSSR) phases due to the COVID-19 outbreak on the air quality in Jakarta. Specifically, this study covered the change of Air Quality Index (AQI) based on five pollutants, PM10, SO2, CO, O3, and NO2, contained in Jakarta’s air before and during LSSR. The AQI data were provided by the Ministry of Environment and Forestry, Indonesia, from January 2019 to December 2020 at five different locations in Jakarta, with missing data for March and September 2020 due to unknown reasons. These data were grouped into the period before the LSSR from January 2019 to February 2020 and the period during LSSR from April 2020 to December 2020. In order to measure the change in the air quality of Jakarta before and during LSSR, we ran a chi-squared test to the AQI for each location and LSSR phase as well as paired one-sided t-test for the seasonal trend. The result of this study showed that, in general, LSSR improved the air quality of Jakarta. The improvement was mainly contributed by reduced transportation activities that were induced by LSSR. Further analysis on the seasonal pollutants trend showed a variation of AQI improvement in each phase due to their unique characteristics

    Aggregating Time Series and Tabular Data in Deep Learning Model for University Students’ GPA Prediction

    No full text
    Current approaches of university students’ Grade Point Average (GPA) prediction rely on the use of tabular data as input. Intuitively, adding historical GPA data can help to improve the performance of a GPA prediction model. In this study, we present a dual-input deep learning model that is able to simultaneously process time-series and tabular data for predicting student GPA. Our proposed model achieved the best performance among all tested models with 0.4142 MSE (Mean Squared Error) and 0.418 MAE (Mean Absolute Error) for GPA with a 4.0 scale. It also has the best R2R^{2} -score of 0.4879, which means it explains the true distribution of students’ GPA better than other models

    Utilizing Mobile-based Deep Learning Model for Managing Video in Knowledge Management System

    No full text
    Knowledge Management (KM) system is a core feature in facilitating intellectual growth in organization. However, there are numerous difficulties in maintaining a reliable KM system. One of the challenges is to manage knowledge materials in video format. A video file contains complex data that lead to the difficulties in managing them. Without an intelligent system, managing videos for KM requires a laborious effort. In this paper, an intelligent framework for KM system, embedded with deep learning model, is proposed. The use of the deep learning model alleviates the heavy burden of video materials management in KM system. To enhance the agility of the system, mobile-based deep learning model is utilized in the framework.</p

    Database System for Storing Tuberculosis Sputum Sample Images as an AI Training Dataset

    No full text
    The high prevalence of Tuberculosis (TB) in Indonesia puts Indonesia in the second-highest national TB prevalence in the world after India. This high prevalence can cause a failure to deliver medical treatments to TB patients, which is exacerbated by the disproportionate distribution of doctors in Indonesia. To address this issue, an AI system is necessary to help doctors in screening a large number of patients in a short time. However, to develop a robust AI for this purpose, we need a large dataset. This study aims to develop a database system for storing TB sputum sample images, which can be used as the dataset to train an AI for TB detection. The developed system can help doctors and health workers to manage the images during their daily job. After a period of time, the stored images can be utilized as the dataset to train AI
    corecore